Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs
نویسندگان
چکیده
This paper proposes a method to re-estimate output visual parameters for speech-to-lip movement synthesis using audio-visual Hidden Markov Models(HMMs) under the Expectation-Maximization(EM) algorithm. In the conventional methods for speech-to-lip movement synthesis, there is a synthesis method estimating a visual parameter sequence through the Viterbi alignment of an input acoustic speech signal using audio HMMs. The HMM-Viterbi method produces the output visual parameters per HMM state speci ed by the decoded HMM states. However, the HMM-Viterbi method involves a substantial problem, which is caused by the deterministic decoding process to assign a single HMM state for an input audio frame. The deterministic process may output incorrect visual parameters due to incorrect HMM state alignment. The proposed method avoids the deterministic decoding process by the non-deterministic visual parameter estimation by the EM algorithm. The proposed method repeatedly estimates the visual parameter sequence while maximizing the likelihood of the audio-visual observation sequence using audio-visual HMMs. The objective evaluation shows that the proposed method is more e ective than the HMM-Viterbi method especially for the bilabial consonants.
منابع مشابه
Lip Motion Generation from Audio Signals based on 1王idden Markov Models
Speech recognition or computer lipreading has been developed as a computer input. It is also important to provide a natural and friendly interface as an output. Recently, there has been increasing interest in using both auditory and visual modalities of speech processing. Especially, in research of human perception, the effect of integrating auditory and visual modalities has been investigated ...
متن کاملSubjective Evaluation for HMM-Based Speech-To-Lip Movement Synthesis
An audio-visual intelligibility score is generally used as an evaluation measure in visual speech synthesis. Especially an intelligibility score of talking heads represents accuracy of facial models[1][2]. The facial models has two stages such as construction of real faces and realization of dynamical human-like motions. We focus on lip movement synthesis from input acoustic speech to realize d...
متن کاملHMM-based text-to-audio-visual speech synthesis
This paper describes a technique for text-to-audio-visual speech synthesis based on hidden Markov models (HMMs), in which lip image sequences are modeled based on imageor pixel-based approach. To reduce the dimensionality of visual speech feature space, we obtain a set of orthogonal vectors (eigenlips) by principal components analysis (PCA), and use a subset of the PCA coefficients and their dy...
متن کاملVisual Speech Synthesis Based on Parameter Generation From HMM: Speech-Driven and Text-And-Speech-Driven Approaches
This paper describes a technique for synthesizing synchronized lip movements from auditory input speech signal. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. Audio-visual speech unit HMMs, namely, syllable HMMs are trained with parameter vector sequences that represent both auditor...
متن کاملLip-reading from parametric lip contours for audio- visual speech recognition
This paper describes the incorporation of a visual lip tracking and lip-reading algorithm that utilizes the affine-invariant Fourier descriptors from parametric lip contours to improve the audio-visual speech recognition systems. The audio-visual speech recognition system presented here uses parallel hidden Markov models (HMMs), where a joint decision, using an optimal decision rule, is made af...
متن کامل